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Abstract. Transcriptional interactions in a cell are modulated by a va- 
riety of mechanisms that prevent their representation as pure pairwise 
interactions between a transcription factor and its target (s). These in- 
clude, among others, transcription factor activation by phosphorylation 
and acetylation, formation of active complexes with one or more co- 
factors, and mRNA/protein degradation and stabilization processes. 
This paper presents a first step towards the systematic, genome-wide 
computational inference of genes that modulate the interactions of spe- 
cific transcription factors at the post-transcriptional level. The method 
uses a statistical test based on changes in the mutual information be- 
tween a transcription factor and each of its candidate targets, conditional 
on the expression of a third gene. The approach was first validated on a 
synthetic network model, and then tested in the context of a mammalian 
cellular system. By analyzing 254 microarray expression profiles of nor- 
mal and tumor related human B lymphocytes, we investigated the post 
transcriptional modulators of the MYC proto-oncogene, an important 
transcription factor involved in tumorigenesis. Our method discovered 
a set of 100 putative modulator genes, responsible for modulating 205 
regulatory relationships between MYC and its targets. The set is signifi- 
cantly enriched in molecules with function consistent with their activities 
as modulators of cellular interactions, recapitulates established MYC reg- 
ulation pathways, and provides a notable repertoire of novel regulators 
of MYC function. The approach has broad applicability and can be used 
to discover modulators of any other transcription factor, provided that 
adequate expression profile data are available. 

1 INTRODUCTION 

The reverse engineering of cellular networks in prokaryotes and lower eukary- 
otes [1,2], as well as in more complex organisms, including mammals [3-5], is 
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Fig. 1: Synthetic network model of transistor-like regulatory logic, (a) Transistor model, 
(b) Schematic representation of the synthetic network, (c) Unconditional MI, condi- 
tional MI and conditional MI difference for the TF interactions conditioning on the 
expression level of the Ki and the CF; entries colored in red are determined to be 
statistically significant. 



unraveling the remarkable complexity of cellular interaction networks. In par- 
ticular, the analysis of targets of specific transcription factors (TF) reveals that 
target regulation can change substantially as a function of key modulator genes, 
including transcription co-factors and molecules capable of post-transcriptional 
modifications, such as phosphorylation, acetylation, and degradation. The yeast 
transcription factor STE12 is an obvious example, as it binds to distinct target 
genes depending on the co-binding of a second transcription factor, TEC1, as well 
as on the differential regulation by MAP kinases FUS3 and KSS1 [6]. Although 
the conditional, dynamic nature of cellular interactions was recently studied in 
yeast [7-10], methods to identify a genome- wide repertoire of the modulators of 
a specific transcription factor are still lacking. 

In this paper, we explore a particular type of "transistor like" logic, shown in 
Fig. la, where the ability of a transcription factor g TF (emitter) to regulate a tar- 
get gene gt (collector) is modulated by a third gene g m (base), which we shall call 
a modulator. Pairwise analysis of mRNA expression profiles will generally fail to 
reveal this complex picture because g m and g TF (e.g., a kinase and a transcrip- 
tion factor it activates) are generally statistically independent and because the 
correlation between the expression of g TF and gt is averaged over an entire range 



of values of g m and thus significantly reduced. However, we show that by con- 
ditioning on the expression of the modulator gene (e.g., an activating kinase), a 
statistically significant change in the g TF <-> gt correlation can be measured, thus 
directly identifying key post-transcriptional regulation mechanisms, including 
modifications by signaling molecules, co-factor binding, chromatin accessibility, 
modulation of protein degradation, etc. An important clement of this analysis 
is that, while signaling proteins are conventionally viewed as constitutively ex- 
pressed, rather than transcriptionally modulated, in practice their abundance 
within a cell population is subject to fluctuations (either functional or stochas- 
tic) . Depending on the number of available microarray expression profiles and on 
the range of fluctuation, this may be sufficient to establish a g TF «-> g t statistical 
dependency, conditional on the availability of one or more signaling molecules. 

We validated the approach on a simple synthetic network and then applied 
it to the identification of key modulators of MYC, an important TF involved 
in tumorigencsis of a variety of lymphomas. We identify a set of fOO putative 
modulators, which is significantly enriched in genes that play an obvious post- 
transcriptional or post-translational modulation role, including kinases, acyl- 
transferases, transcription factors, ubiquitination and mRNA editing enzymes, 
etc. Overall, this paper introduces the first genome- wide computational approach 
to identify genes that modulate the interaction between a TF and its targets. We 
find that the method recapitulates a variety of known mechanisms of modulation 
of the selected TF and identifies new interesting targets for further biochemical 
validation. 

2 METHOD 

As discussed in [11,12], the probability distribution of the expression state of 
an interaction network can be written as a product of functions of the indi- 
vidual genes, their pairs, and higher order combinations. Most reverse engi- 
neering techniques are either based on pairwise statistics [5,11,13], thus failing 
to reveal third and higher order interactions, or attempt to address the full 
dependency model [14], making the problem computationally untractable and 
under-sampled. Given these limitations, in this paper we address a much more 
modest task of identifying the "transistor-like" modulation of specific regulatory 
interaction, a specific type of third order interactions that is biologically impor- 
tant and computationally tractable in a mammalian context. Furthermore, given 
the relatively high availability of microarray expression profile data, we restrict 
our analysis to only genes that modulate transcriptional interactions, i.e., a TF 
regulating the expression of its target gene(s). 

In our model, just like in an analog transistor where the voltage on the 
base modulates the current between the other terminals, the expression state 
of the modulator, g m , controls the statistical dependence between g TF and gt, 
which may range from statistically independent to strongly correlated. If one 
chooses mutual information (MI) to measure the interaction strength (see [11] 
for the rationale), then the monotonic dependence of I{g T F, 9t\9m) on 9mi or l ac k 



thereof, can reveal respectively the presence or the absence of such a transistor- 
like interaction. 

Analysis along the lines of [11] indicates that currently available expression 
profile sets are too small to reliably estimate I(g T F, 9t\9m) as a function of g m . To 
reduce the data requirements, one can discretize g m into well sampled ranges g l m . 
Then. \I(g TF , gtWm) — I{9tf, StlSm) > (at the desired statistical significance 
level) for any range pair (11,12) is a sufficient condition for the existence of the 
transistor logic, either direct (i.e., g m is causally associated with the modulation 
of the TF targets) or indirect (i.e., g m is co expressed with a true modulator 
gene). Below we present details of an algorithm that, given a TF, explores all 
other gene pairs (g m , gt) in the expression profile to identify the presence of the 
transistor logic between the three genes. 

2.1 Selection of candidate modulator genes 

Given a expression profile dataset with N genes and an a-priori selected TF gene 
g TF , an initial pool of candidate modulators g m , {m} G 1,2,..., M, is selected 
from the N genes according to two criteria: (a) each g m must have sufficient 
expression range to determine statistical dependencies, (b) genes that are not 
statistically independent of g TF (based on MI analysis) are excluded. The latter 
avoids reducing the dynamic range of g TF due to conditioning on g m , which 
would unnecessarily complicate the analysis of significance of the conditional MI 
change. It also removes genes that transcriptionally interact with g TF , which can 
be easily detected by pair-wise co-expression analysis, and thus are not the focus 
of this work. We don't expect this condition to substantially increase the false 
negative rate. In fact, it is reasonable to expect that the expression of a post- 
transcriptional modulator of a TF function should be statistically independent 
of the TF's expression. For instance, this holds true for many known modulators 
of MYC function (including MAX, JNK, GSK, and NFkB). 

Each candidate modulator g m is then used to partition the expression pro- 
files into two equal-sized, non-overlapping subsets, and L~ , in which g m is 
respectively expressed at its highest (<;+) and lowest (<?„) levels. The conditional 
MI, 7 ± = I(g T F, Stiffs); is then measured as I(g TF ,gt) on the subset L^. Note 
that this partition is not intended to identify the over or under expression of the 
modulator, but rather to estimate g l m . Then, \I(g TF , gt\g^) — I(9tf, <7t|<7m)l > 
for target genes using the two tails of the modulator's expression range. The size 
of is constrained by the minimal number of samples required to accurately 
measure MI, as is discussed in [11]. Mutual information is estimated using an 
efficient Gaussian kernel method on rank-transformed data, and the accuracy of 
the measurement is known [11]. 

2.2 Conditional mutual information statistics 

Given a triplet (g m , 9tf> gt), we define the conditional MI difference as: 

AI(g TF ,g t \g m ) = \I + - 1~\ = \I{g TF , gt\gt) - I(9TF,gt\g~)\ (1) 



For simplicity, hereafter we use / for the unconditional MI (i.e., the MI across 
all samples) and AI for conditional MI difference. To assess the statistical sig- 
nificance of a AI value, we generate a null hypothesis by measuring its dis- 
tribution across 10 4 distinct (g TF ,gt) pairs with random conditions. That is, 
for each gene pair, the non-overlapping subsets used to measure 1^ and 
AI are generated at random rather than based on the expression of a candi- 
date modulator gene (1000 AI from random sub-samples are generated for each 
gene pair). Since the statistics of AI should depend on /, we binned / into 100 
equiprobable bins, resulting in 100 gene pairs and 10 5 AI measurements per bin. 
Within each bin, we model the distribution of AI as an extended exponential, 
p(AI) ~ exp(~aAI n + (3). which allows us to extrapolate the probability of a 
given AI under this null hypothesis model. As shown in Fig. 2, both the mean 
and the standard deviation of AI increase monotonically with / (as expected) 
and the extended exponentials produce an excellent fit for all bins. Specifically, 
for small /, the exponent of the fitted exponential distribution is n « 1. This is 
because in this case both J + and I~ are close to zero and AI is dominated by the 
estimation error, which falls off exponentially [11]. For large /, the estimation 
error becomes smaller than the true mutual information difference between the 
two random sub-samples, hence n w 2 from the central limit theorem. 



2.3 Interaction-specific modulator discovery 

Given a TF, g TF , and a set of candidate modulators g m selected as previously 
discussed, wc compute I(g TF ,gt) and AI(g TF , g t \g m ) for all genes g t in the ex- 
pression profile such that g t ^ g m and g t ^ g TF . Significance of each AI is then 
evaluated as a function of /, using the extended exponentials from our null hy- 
pothesis model. Gene pairs with a statistically significant p- value (p < 0.05), 
after Bonfcrroni correction for multiple hypothesis testing, are retained for fur- 
ther analysis. 

Significant pairs are further pruned if the interaction between g TF and gt is 
inferred as an indirect one in both conditions g^, based on the ARACNE [5, 
11] analysis on the two subsets L^j. This is accomplished by using the Data 
Processing Inequality (DPI), a well-know property of MI introduced in [5,11], 
which states that the interaction between g TF and g t is likely indirect (i.e. me- 
diated through a third gene g x ), if I(g TF ,g t ) < min[I(g TP ,g x ),I(g t ,g x )]. This 
step eliminates some specific cases, illustrated in Fig. 3, where g m can produce 
a significant AI even though it does not directly affect the g TF <-> gt interaction. 
Briefly, two cases will be addressed by the use of the DPI: (a) g m affects the 
g TF «-> g x interaction instead of g TF «-» gt (Fig. 3b); and (b) g m modulates g x , 
therefore affecting the g x <-> g t interaction instead of the g TF «-» g t . Thus g m is 
not a modulator of the g TF gene and should be removed (Fig. 3c). As discussed 
in [5, 11], the DPI was applied with a 15% tolerance to minimize the impact of 
potential MI estimation errors. 
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Fig. 2: Null distribution for the AI statistics, (a) Mean (/i) and standard deviation 
(a) of the AI statistics in each bin as a function of /. (b) - (d), distribution of the AI 
statistics (blue curves) and the extended exponential function, p(AI) = exp(—aAI n + 
P) (red curves), obtained by least square fitting in bin 25, 50 and 75; a goodness-of-fit 
measure, R 2 , and the value of n are also shown for each bin. 

3 RESULTS 

3.1 Synthetic Model 

We first tested our approach on a simple synthetic network (Fig. lb) that explic- 
itly models two post-translational modifications (activation by phosphorylation 
and by co-factor binding) that modulate the ability of a TF to affect its targets. 
The synthetic network includes a TF, a protein kinase (Ki) that phosphorylates 
the TF, a co-factor (CF) that can bind to TF forming a transcriptionally active 
complex, and three downstream targets of the TF's isoforms. The transcription 
activation/inhibition was modeled using Hill kinetics with exponential decay 
of mRNA molecules. Phosphorylation and cofactor binding were modeled us- 
ing Michaelis Mcnten and mass-action kinetics respectively (see Supplementary 
Table 1 for kinetic equations). 




(c) 

Fig. 3: Schematic diagram of the effect of DPf on eliminating indirect regulatory rela- 
tionships, (a) Correct modulation model where the modulator (M) significantly changes 
the regulatory relationship between the TF (Hub) and its direct targets (Gi and G2). 
(b) Removal of indirect connections to the hub eliminates the detection of a significant 
AI on indirect targets, (c) Modulators that affect the downstream target of the TF 
hub, thus causing significant AI between the TF and its indirect neighbors, will be 
removed by applying DPI. 

A set of 250 synthetic expression profiles was generated from this model us- 
ing Gepasi (ver 3.30) [15] by (a) randomly sampling the independent variables 
(concentration of mRNA for the TF, Ki, and CF) from a uniform distribution, 
so that they were statistical independent (b) simulating network dynamics until 
a steady state was reached, and (c) measuring the concentration of all mRNA 
species that were explicitly represented in the network (using a Gaussian ex- 
perimental noise model with mean and standard deviation equal to 10% of 
the mean concentration for each variable). Note that only the mRNA concen- 
trations were used as inputs to our algorithm, even though all molecular species 
(including all proteins isoforms) were explicitly represented in the model. By 
conditioning on the expression of the Ki and CF, using the 40% of expression 
profiles with their most and least expressed values, our approach correctly iden- 
tified the two and only two significant AI, associated with the pairs (CF, 172) 
and (Ki, (73), as shown in Fig. lc. 

3.2 Analysis of Human B lymphocyte data 

We then used our method to identify a genome-wide repertoire of post-transcrip- 
tional modulators of the MYC proto-oncogene - a TF which represents a ma- 



jor hub in the B cell transcriptional network [5]. The analysis was performed 
on a collection of 254 gene expression profiles, representing 27 distinct cellu- 
lar phenotypes derived from populations of normal and neoplastic human B 
lymphocytes. The gene expression profiles were collected using the Affymctrix 
HG-U95A GeneChip® System (approximately 12,600 probes). Probes with ab- 
solute expression mean fi < 50 and standard deviation a < 0.3/x, were considered 
non-informative and were excluded a-priori from the analysis, leaving 7907 genes. 

We further selected 1117 candidate modulators with sufficient expression 
range (/x > 200 and a > 0.5/x) that were statistically independent of MYC based 
on MI (significance was established as in [11]). The top 40% and bottom 40% 
of the expression profiles in which a candidate modulator g m is expressed at its 
highest and lowest levels, respectively, were used to define the two conditional 
subsets L^j. The choice of the 40% threshold was specific to this dataset. It 
ensured that > 100 samples were available within each conditional subset for 
estimating MI with a reasonable accuracy [11], while keeping the modulators' 
expression range within the two subsets as separated as possible. 

The analysis inferred a repertoire of 100 genes, at a 5% statistical significance 
level (Bonferroni corrected), which are responsible for modulating 205 regulatory 
relationships between MYC and its 130 inferred targets in an interaction-specific 
fashion. See Supplementary Fig. 1 for a map of the modulators and the affected 
interactions. A complete list is available in the Supplementary Table 2. 



3.3 Gene Ontology enrichment analysis 

To analyze the biological significance of these putative modulator genes, we 
studied the enrichment of the Gene Ontology [16] Molecular Function categories 
among the 100 modulators compared to the initial list of 1117 candidate modula- 
tors. As shown in Table 1, the top enriched categories represent functions consis- 
tent with their activities as modulators of cellular interactions. In particular, pu- 
tative modulators were enriched in kinases (PKN2, MAP4K4, BRD2, CSNK1D, 
HCK, LCK, TRIB2, BRD2 and MARCKS), acyltransferase (GGT1, SAT and 
TGM2) and transcriptional regulators (CUTL1, SSBP2, MEF2B, ID3, AF4, 
BHLHB2, CREM, E2F5, MAX, NR4A1, CBFA2T3, REL, FOS and NFKB2). 
This is in agreement with the established evidence that MYC is modulated 
through phosphorylation and acetylation events, affecting its protein stabil- 
ity [17, 18], and that MYC requires broadly distributed effector proteins to influ- 
ence its genomic targets [19]. We also found that 4 of the 6 modulators with the 
largest number of affected targets (e.g. UBE2G1, HCK, USP6 and IFNGR1), 
are associated with non-target-specific functions (e.g. protein degradation, up- 
stream signaling pathway components and receptor signaling molecules, etc). 
On the other hand, the 14 modulators that are transcription factors (and may 
thus be MYC co-factors) tend to be highly interaction-specific, affecting only 
1-4 target genes (see Supplementary Fig. 1). 



Table 1: Most enriched Gene Ontology Molecular Function categories for the inferred 
MYC modulators. False discovery rate (FDR) are calculated from Fisher's exact test 
and adjusted for multiple hypothesis testing. Only categories with at least 5 genes from 
the initial 1117 candidate modulators were used. 



Gene Ontology Molecular Function Categories 


Enrichment FDR 


DNA binding 


0.007 


Transferase activity 


0.010 


Acyltransferase activity 


0.010 


Antioxidant activity 


0.018 


Phosphoric monoester hydrolase activity 


0.026 


Adenyl nucleotide binding 


0.028 


Transcription regulator activity 


0.052 


Protein serine/threonine kinase activity 


0.066 



3.4 Literature validation of known MYC modulators 

Closer scrutiny, through literature review reveals that a number of the inferred 
modulators play a role in the post-transcriptional and post-translational mod- 
ulation of MYC, either by direct physical interaction, or by modulating well- 
characterized pathways that are known to affect MYC function. 

Among the list of putative modulators, we found two well known co- factors 
of MYC: MAX and MIZ-1. Numerous studies, [20] among many others, have 
shown that transcriptional activation by MYC occurs via dimerization with its 
partner MAX. Similarly, MIZ-1 has been shown to specifically interact with 
MYC through binding to its helix-loop-helix domain, which may be involved in 
gene repression by MYC [21]. Several protein kinases identified by our method 
are also notable: CSNK1D, a member of the Casein Kinase I gene family, is a 
reasonable MYC modulator since one of its related family member, Casein Ki- 
nase II, has been demonstrated to phosphorylate MYC and MAX, thus affecting 
the DNA-binding kinetics of their heterodimer [22, 23]. MYC is also know to be 
phosphorylated by JNK [24] and GSK [25], which affect the stability of its pro- 
tein. Although both kinases were excluded from our initial candidate modulator 
set due to their insufficient expression range, our approach was able to identify 
some of their upstream signaling molecules, such as MAP4K4 and HCK. Both 
MAP4K4 and HCK are members of the BCR singling pathway that is known to 
control MYC activation and degradation [26] . In particular, MAP4K4 has been 
previously reported to specifically activate JNK [27]. 

MYC stability is also known to be regulated through ubiquitin-mediated pro- 
teolysis [28]. Two enzymes in this process, USP6 and UBE2G1, were identified as 
putative modulators of MYC. Although there is no biochemical evidence impli- 
cating these two proteins specifically, they serve as a reasonable starting point for 
biochemical validation. We also identified putative modulators that could poten- 
tially influence the MYC mRNA stability. One of them, APOBEC3B, is closely 
related to APOBEC1, which has been well characterized as a RNA-editing en- 
zyme capable of binding MYC mRNA in the 3' untranslated region, thus increas- 
ing its stability [29]. While APOBEC1 was excluded from our analysis due to its 



insufficient expression range, the identification of its closely related family mem- 
ber, APOBEC3B, may suggest a similar mechanism. Another protein from this 
category, HNRPDL, encodes a heterogeneous nuclear ribonucleoprotein which 
interacts with mRNA and may have a role in mRNA nuclear export. 

MYC stimulates gene expression in part at the level of chromatin, through its 
association with co-factors that affect the histone acetylation and DNA methyla- 
tion. DNMT1, which encodes a DNA methyltransferase, was found in our puta- 
tive modulator list. Current literature suggests that MYC may repress transcrip- 
tion through the recruitment of DNA methyltransferase as corepressor, which 
may in turn lead to hypoacctylatcd histones that arc often associated with tran- 
scriptional silencing [30,31]. 

Many other genes in our list of putative modulators of MYC also present 
relevant biological functionality, such as transcription factors FOS, CREM, REL 
and NFKB2, anti-apoptosis regulator BCL-2, to name but a few. Those for which 
functional relevance can not be established from the current literature likely 
belong to two groups: (a) novel bona fide MYC modulators requiring further 
biochemical validation and (b) genes that are co-expressed with a bona fide 
modulator, such as gene from the same biological pathway. A likely example of 
the latter case is NFKB2 and its inhibitor NFKBIA, which are both identified 
as modulators of MYC, while having substantially correlated expression profiles 
(Pearson correlation 0.55). 

3.5 Transcription factors co-binding analysis 

For the putative modulators annotated as TF, one potential mechanism of mod- 
ulation is as MYC co-factors. We thus searched for the binding signatures of both 
MYC and the modulator-TF within the promoter region of the genes whose in- 
teraction with MYC appeared to be modulated by the TF. Of the 14 TFs in our 
putative modulator list, 8 have credibly identified DNA binding signatures from 
TRANSFAC [32] (represented as position-specific scoring matrix). These TFs 
affect 12 MYC interactions with 11 target genes. Additionally, 4 of the 5 target 
genes whose expressions are positively correlated with MYC present at least one 
E-Box in their promoter region (p < 4.03 x 10" 4 ) 4 . As is shown in Table 2, 
of the 12 instances of statistically significant modulator target pairs, 7 target 
genes harbor at least one high specificity TF binding signature (Pbs < 0.05) in 
their promoter region. The overall p-value associated with this set of events is 
p < 0.0025 (from the binomial background model). This strongly supports the 
hypothesis that these TFs are target specific co-factors of MYC. 

4 CONCLUSION AND DISCUSSION 

Cellular interactions can be neither represented as a static relationship nor mod- 
eled as pure pairwise processes. The two issues are deeply interlinked as higher 

4 MYC is known to transcriptional activate its targets through binding to E-box ele- 
ments. Repression by MYC occurs via a distinct mechanism, not involving E-boxes, 
which is not yet well characterized. 



Table 2: Promoter analysis of the MYC target genes affected by TF-modulators. Bind- 
ing signatures of 8 of the 14 TFs in the putative modulator list were obtained through 
TRANSFAC. Promoter sequences of the target genes (2Kb upstream and 2Kb down- 
stream of the transcription initiation site) were retrieved from the UCSC Golden Path 
database [33] and masked for repetitive elements. Statistical significance is assessed by 
considering a null score distribution computed from random sequences using an order-2 
Markov model learned from the actual promoter sequences, where Pbs is calculated 
as the probability of finding at least one binding site per 1Kb sequence under the null 
hypothesis. We used a significance threshold of 0.05; findings below this threshold are 
shown in red. 



Modulator 


Binding Signature 


MYC targets 


Pbs 


CUTL1 




NP 


0.032 




CREM 




PRKDC 


0.041 








TLE4 


0.030 


BHLHB2 




MLL 


0.039 






IL4R 


0.140 




ctatttatag 


KLF12 


0.391 


MEF2B 


CR2 


0.326 






CYBB 


0.045 


FOS 




ZNF259 


0.049 




REL 




KEL 


0.125 




E2F5 




IL4R 


0.870 




NFKB2 




FOXK2 


0.041 



order interactions are responsible for the rewiring of the cellular network in a 
context dependent fashion. For transcriptional interactions, one can imagine a 
transistor-like model, in which the ability of a TF to activate or repress the 
expression of its target genes is modulated, possibly in a target-specific way, by 
one or more signaling proteins or co-factors. Such post-transcriptional and post- 
translational conditional interactions are necessary to create complex rather than 
purely reactive cellular behavior and should be abundant in biological systems. 

Unfortunately, most post-translational interactions (e.g., phosphorylation or 
complex formation) do not affect the mRNA concentration of the associated 
proteins. As a result, they are invisible to naive co-expression analysis meth- 
ods. However, proteins that are involved in post-translational regulation may 
be themselves transcriptionally regulated. At steady state, the concentration of 
such post-translationally modified proteins and complexes can then be expressed 
as a function of some mRNA expressions, albeit in a non-obvious, conditional 
fashion. With this in mind, we show that conditional analysis of pairwise statis- 
tical dependencies between mRNAs can effectively reveal a variety of transient 
interactions, as well as their post-transcriptional and post-translational regula- 
tions. 



In this paper, we restrict our search to genes that affect the abiiity of a given 
TF to transcriptionally activate or repress its target (s). While the identification 
of the targets of a transcription factor is a rather transited area, the identification 
of upstream modulators is essentially unaddressed at the computational level, 
especially in a TF-target interaction specific way. Experimentally, it constitutes 
an extremely complex endeavor that is not yet amenable to high-throughput 
approaches. For instance, while hundreds of MYC targets are known, only a 
handful of genes have been identified that are capable of modifying MYC's ability 
to activate or repress its targets. Even fewer of these are target specific. 

We show that such modulator genes can be accurately and efficiently iden- 
tified from a set of candidates using a conditional MI difference metric. One 
novelty of our approach is that it requires no a-priori selection of the modulator 
genes based on certain functional criteria: the candidate modulators include all 
genes on the microarray that have sufficient dynamic range and no significant MI 
with the TF gene. The first requirement can be actually lifted without affecting 
the method other than making it more computationally intensive and requiring 
more stringent statistical tests (as the number of tested hypotheses would obvi- 
ously increases) . This is because conditioning on a gene with an expression range 
comparable to the noise is equivalent to random sub-sampling of the expression 
profiles, an event that will be filtered out by our statistical test. 

Another critical element of our method is the phenotypic heterogeneity of the 
expression profiles. This ensures that no sufficiently large subset of the expres- 
sion profiles can be obtained without sampling from a large number of distinct 
phenotypes, including both normal and malignant cells. In fact, the average 
number of distinct cellular phenotypes in any subset of gene expression profiles 
used in the analysis is about 20, with no subset containing fewer than 13. Thus, 
the modulators identified in this paper are not associated with a specific cellular 
phenotypc. 

To derive a null model for estimating the significance of individual condi- 
tional mutual information differences, AI, we investigated the statistics of AI 
as a function of the unconditional mutual information /. Other models, such as 
dependence of AI on I + or I~ , were also investigated, but they proved to be 
less informative. It is possible that a more accurate null model may be learned 
by studying the variation of AI as a function of both / and either / + or I~ . 
For example, this may answer questions such as: given measured values of / and 
I~ , what is the probability of seeing a specific difference in All While this may 
provide finer-grained estimates of the statistical significance, this also would dra- 
matically increases the number of Monte Carlo samples necessary for achieving 
a reasonable numerical precision, which would prohibit the actual deployment 
of the strategy. 

Method limitations follow broadly into three categories: (a) The computa- 
tional deconvolution of molecular interaction is still manifestly inaccurate. This 
has obvious effects on the discovery of interaction-specific modulators, i.e. we 
may identify modulators of "functional" rather than physical interactions, (b) 
The method cannot dissect modulators that are constitutively expressed (house- 



keeping genes) and activated only at the post-translational level (e.g., the p53 
tumor suppressor gene), nor modulators that are expressed at very low concen- 
trations. However, in both cases a gene upstream of the most direct modulator 
may be identified in its place. For instance, JNK is a known modulator of MYC 
activity, which is weakly expressed in human B cells and, therefore it is not even 
included in the initial candidate modulator list. However, MAP4K4, which is 
upstream of JNK in the signaling cascade, is identified as a MYC modulator 
in its place, (c) The method cannot disambiguate true modulators from those 
co-expressed with them. 

Techniques to deal with all these drawbacks are currently being investigated. 
However, we believe that, even in its current state, our approach presents a 
substantial advancement in the field of reverse engineering of complex cellular 
networks. 

5 SUPPLEMENTARY MATERIAL 

Supplementary Materials are available at: http://www.dbmi.columbia.edu/ 
~kaw7002/recomb06/ supplement .html. 
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